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ABSTRACT 


In educational applications, Knowledge Tracing (IKT) has 
been widely studied for decades as it is considered a funda- 
mental task towards adaptive online learning. Among pro- 
posed KT methods, Deep Knowledge Tracing (DKT) and 
its variants are by far the most effective ones due to the 
high flexibility of the neural network. However, DKT often 
ignores the inherent differences between students (e.g. mem- 
ory skills, reasoning skills, ...), averaging the performances 
of all students, leading to the lack of personalization, and 
therefore was considered insufficient for adaptive learning. 
To alleviate this problem, in this paper, we proposed Leveled 
Attentive KNowledge TrAcing (LANA), which firstly uses a 
novel student-related features extractor (SRFE) and pivot 
modules to distill and distinguish students’ unique inherent 
properties from their respective interactive sequences. More- 
over, inspired by Item Response Theory (IRT), the inter- 
pretable Rasch model was used to cluster students by their 
ability levels, and thereby utilizing leveled learning to assign 
different encoders to different groups of students. With pivot 
module reconstructed the decoder for individual students 
and leveled learning specialized encoders for groups, person- 
alized DKT was achieved. Extensive experiments conducted 
on two real-world large-scale datasets demonstrated that our 
proposed LANA improves the AUC score by at least 1.00% 
(ie. EdNet t 1.46% and RAIEd2020 t 1.00%), substantially 
surpassing the other State-Of-The-Art KT methods. 
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1. INTRODUCTION 


Knowledge Tracing (KT) aims to accurately retrieve stu- 
dents’ knowledge states at a certain time by his past sequen- 
tial exercising interactions. To evaluate KT’s performance, 
it is asked to predict the correctness of students’ future ex- 
ercises with the retrieved knowledge states as Equation 1 
represented. 
P(r li, I ’ q3', ee) Tp ’ {Tis —reii}), Ip = (eee ’ ad i 
1 
Where e;’"” is referred as student s; € N+ answering ques- 
tion q; € Nt at discrete time step t € NT, c% represents 
the contextual information of question q; (e.g. related con- 
cepts, part, etc.) [23, 14, 10, 4], and rf‘ € {0,1} repre- 
sents the correctness of student s;’s answer to q; at time 
t. Additionally, the student’s interaction sequence is de- 
fined as Si 4, = {If'|to < t < ti} and « is defined as 
Ki’ = {8:,q;,c0%,r;7'}, referring to all features that partici- 
pated in one interaction J;* for latter explanation. 


Traditionally, KT was regarded as a sequential behavior 
mining task [8, 17], and therefore various methods estab- 
lished models with the theory of bayesian probability (BKT [3]) 
and psycho-statistics (IRT [5]), providing excellent inter- 
pretability and good performance. Nevertheless, recently 
proposed Deep Knowledge Tracing (DKT) [16] and its vari- 
ants [13, 4, 14, 1, 18] significantly outperform other KT 
methods in metrics using Recurrent Neural Network (RNN) 
and Long Short Term Memory (LSTM [6]). However, DKT 
distinctly lacks personalization for students compared to 
BKT and IRT [15, 25], which are capable of separately train- 
ing unique models for each student, while DKT only trains 
a unified model for all students due to massive training data 
and abundant computing resources required by deep learn- 
ing. Hence, DKT weakly reflects the large inherent property 
(i.e. memory skills, reasoning skills, or even guessing skills) 
gaps between students. 


ASSUMPTION 1. For any interactive sequences satisfying 
Se 1,|| > O >> 1, |[K]| > W and te -—ti > E, Sti 4, can be 


to,t1 to,t1 
ras - 835 S; . 
distinguished from S,°,, and S;i 4, respectively. 


Is it possible to bring personalization back to DKT? To an- 
swer this question, we observed that the proactive behavior 
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sequence (i.e. interactive sequences) of each individual is 
unique and changeable over time. Hence, we argue that the 
minimal personalized unit in KT is “a student at a certain 
time t,;” instead of just “a student”, and student’s inherent 
properties at time t; can be represented by his interactive 
sequences around time ¢; (Assumption 1). In such a way, 
these student-related features could tremendously help per- 
sonalize the KT process since they could be used to identify 
different students at different stages. Consequently, in our 
proposed Leveled Attentive KNowledge TrAcing (LANA), 
unique student-related features are distilled from students’ 
interactive sequence by a Student-Related Features Extrac- 
tor (SRFE). Moreover, inspired by BKT and IRT that assign 
completely different models to different students, LANA, as 
a DKT model, successfully achieves the same goal in a dif- 
ferent manner. Detailedly, instead of separately training 
each student a model like BKT and IRT, LANA learns to 
learn correlations between inputs and outputs on attention 
of the extracted student-related features, and thus becomes 
transformable for different students at different stages. More 
specifically, the transformation was accomplished using pivot 
module and leveled learning, where the former one is a model 
component that seriously relies on the SRFE, and the lat- 
ter one is a training mechanism that specializes encoders for 
groups with interpretable Rasch model defined ability levels. 
Formally, the LANA can be represented by: 


Adaptive by Pivot Module 
A 
re ~ (fp ))(he®) 5 Deh ~ Rh), he’ ~ g(hSt, Sot), 
ee 
Adaptive by Leveled Learning 
(2) 
where h;' is referred as student s;’s knowledge state at time t 


respectively, f(-) (decoder), g(-) (encoder) and k(-) (SRFE) 
are three main modules that LANA seeks to learn. 


2. METHODOLOGY 
2.1 Base Modifications 


There are mainly two base modifications in the LANA model 
(Figure 1) that were made to the basic transformer. Firstly, 
in the LANA model, the positional information (e.g. posi- 
tional encoding, positional embedding) was directly fed into 
the attention module with a private linear projection, in- 
stead of being added to the input embedding and shared 
the same linear projection matrix with other features in 
the input layer. Although experiments in [22] suggested 
that blending input embedding with positional information 
is effective, recently some work [19] debated that when the 
model becomes deeper, it tends to “forget” the positional 
information fed into the first layer. Moreover, some other 
work [9] believed that adding positional information to the 
input embedding and offering them to the attention module, 
is essentially making them share the same linear projection 
matrix, which is not reasonable since the effects of the input 
embedding and the positional information are clearly dis- 
tinctive. For exactly the same reason, in the LANA model, 
multiple input embeddings (i.e. question ID embedding, 
student ID embedding, etc.) are concatenated instead of 
added, leading to the second base modification. Specifically, 
assumes there are m input embeddings in total, each with 
a dimension of D/. Then after concatenating, the input 
embedding would have a total dimension of D™!. Hence, a 
D™! — D?! linear projection layer was used to map the con- 
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Figure 1: The overall model architecture of LANA. 
There are mainly three differences compared to vanilla 
transformer-based KT method [1, 18]: I. Modifications to 
the basic transformer model. II. Introduced SRFE and III. 
Introduced PMA Module and PC-FFN Module, which col- 
lectively referred to as pivot module. 


catenated input embedding of dimension D™! to dimension 
DF, 


2.2 Student-Related Features Extractor (SRFE) 
Student-Related Features Extractor (SRFE) summaries stu- 
dents’ inherent properties from their interactive sequences 
with Assumption 1 for the pivot module to personalize the 
parameters of the decoder. Specifically, SRFE contains an 
attention layer and several linear layers, where the atten- 
tion layer was used to distill student-related features from 
the provided information by the encoder, and the linear lay- 
ers were leveraged to refine and reshape these features. It is 
notable that in the LANA model there were primarily two 
SRFEs: memory-SRFE and performance-SRFE, where the 
former one was utilized to derive students’ memory-related 
features for the PMA module (be introduced later) and the 
latter one was dedicated to distill students’ performance- 
related features (i.e. Logical thinking skill, Reasoning skill, 
Integration skill, etc.) for PC-FFN module (introduced later 
either). The reshaping process was drawn in Figure 3 for 
better illustration, where bs, Nhreads, seq and dpi, are re- 
ferred to as the model’s batch size, the number of atten- 
tion heads [22], the length of the input sequence and the 
dimension of performance-related features. The intuition 
that memory-related features have a second dimension of 
Nheaad Comes from the theory that each attention head only 
pays attention to one perspective of the features. Thus it is 
reasonable that each student has different memory skills for 
different attention heads (e.g. for different concepts). 


2.3 Pivot Module 
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Figure 2: The workflow of leveled learning: interpretable 
Rasch model was leveraged to analyze students’ overall abil- 
ity levels, and then cluster students into multiple layers, 
where each layer would respectively fine-tune the LANA 
model by its own training data. 


Linear 
Linear 
Linear 


[ bs, seq, dim] 


Figure 3: The data shape transformation of two SRFE: 
Memory-SRFE and Performance-SRFE. 


Provided an ordinary input x, a student-related features p 
and a target output y, pivot module learns the process of 
learning how to project x to y based on p, instead of simply 
learning to project x to y (i.e. Pivot module learns to learn) 
as Equation 3 shown. 


y = (f(p))(2), (3) 


where f(-) here is the function that pivot module learns to 
learn. That is, the projection matrix of x is adapted to p 
instead of being fixed. To accomplish this dynamic mapping, 
the weight and bias of x need to be a projection from p. 
Assumes p € R??, « € R?* and y € R?#, Equation 3 could 
be formally presented in Equation 4: 


y= We +0", (4) 


where W® € R?¥*?= and b® € R?”. Since W® and b” is de- 
rived from p, the detailed transformation could be revealed 
in Equation 5, which was also depicted in Figure 4 for better 
illustration. 


W* =Wrp+oi, 6° =Wipt ob, (5) 


where WP € R&PuXP2)xPp oP C RP uxPa) WP c RPuXPr 
and bf € R?». 


By simplification, Equation 3 can be defined as Equation 6, 
being named as PivotLinear (2, p). 


y = (Wp)x + b = PivotLinear(z, p), (6) 


where W € R?vXPe*Pp and bE RY. 


In the LANA model, there are primarily two modules that 
pertain to the pivot module: Pivot Memory Attention (PMA) 
Module and Pivot Classification Feed Forward Network (PC- 
FFN) Module. In many methods [4, 14], Vanilla Mem- 
ory Attention (VMA) Module was employed to consider the 
“forgetting” behavior of students, which is pivotal in KT’s 
context since students are very likely to have done similar 
exercises to the one he is going to do, and if the student 
could remember the answers to previous similar exercises, 
the probability of him correctly answering the future related 
exercises will be increased greatly. Inspired by the Ebbing- 
haus Forgetting Curve [12] and much previous work [14, 4], 
“forgetting” behavior of students are defined as exponentially 
decaying weights of corresponding interactions in the time- 
line. Detailedly, in the original attention module, the weight 
of item j on item k, i.e. aj,~, is determined by the sigmoid 
result of the similarity between item j and item k: 


hes sim(j, k) 
jk Yo, sim(j, k’)’ 


where sim/(-) is a function to calculate the similarity between 
item 7 and item j by dot production. In order to take “for- 
getting” behavior into a;,x’s account (e.g. The further away 
from j, the lower the weight a;,, would be), we replaced 
Equation 7 with Equation 8: 


(7) 


oe (9+m)-dis(j,k) ; sim(j, k) 


Sy sim.) 
where m is the student’s memory-related features extracted 
in memory-SRFE, 6@ is a private learnable constant that de- 
scribes all students’ average memory skill in the PMA mod- 
ule, and dis(-) calculates the time distance between item 
j and item k (e.g. item j is done dis(j,k) minutes after 
item k is done). The reason for representing the memory 
skill with two learnable parameters is to reduce the diffi- 
culty for model converging since m has a much longer back- 
propagation path compared to 0. When @ is introduced to fit 


(8) 


Aj,k,m = 


He 

Pp x WPETT | + pPLL TT) = 6 
Beis 

p x wP + bP = w* 

x | x w* + p* - JERE 


Figure 4: An illustration of the data transformation in the 
pivot module. 
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the average memory skill of all students, the distribution of 
m becomes a Gaussian distribution, which makes the model 
much easier to learn. 


On the other hand, PC-FFN was utilized to make the final 
prediction in reference to the performance-related features, 
which essentially is a PivotLinear module with a dropout 
and activation. The idea of this module comes from many 
investigations that the early layers in a deep neural network 
are often used as a feature extractor while the latter layers 
are often used as a decision-maker to decide which feature 
is useful to the output of the model. As a result, these in- 
vestigations point out that many models are actually having 
similar early layers, and it is the latter layers that make these 
models distinctive in usage. Consequently, PC-FFN in the 
LANA model was utilized as a personalized decision-maker 
to adaptively make the final prediction based on students’ 
distinctive inherent properties: 


PC-—FFN(a, p) = «+PivotLinear(PivotLinear(a, p), p), 

(9) 
where p is the students’ performance-related features ex- 
tracted in the performance-SRFE. 


2.4 Leveled Learning 

While the pivot module enables the decoder to be trans- 
formable for different students, the encoder and the SRFE 
of the LANA model that provides necessary information for 
the pivot module remains the same for all students. This is 
not problematic if the length of the input sequence is large 
enough since Assumption 1 assures long sequences are al- 
ways distinguishable, unless they both belong to the same 
student at the same time period. However, DKT, espe- 
cially transformer-based DKT, can only be inputted with 
the latest n (commonly n = 100) interactions at once due to 
the limited memory size and high computational complex- 
ity. Consequently, it is possible for the encoder and SRFE 
to output similar results for two different students, resulting 
in a failure for the decoder to adapt. To alleviate this prob- 
lem, it is natural to think of assigning different students with 
different encoders and SRFEs that are highly specialized 
(sensitive) to their assigned students’ patterns. However, 
in practice, it is not feasible to train a unique encoder for 
each individual student considering both the limited train- 
ing time and the limited training data. As a result, a novel 
leveled learning (Figure 2) method was proposed to address 
this problem, which was initially inspired by the fine-tuning 
mechanism in transfer learning [20], where we consider each 
student a unique task, and we want to transfer a model that 
fits well on all students to one student s, efficiently. 


Leveled learning holds the view that the earlier layers of a 
model are similar for similar tasks. Thus, to save training 
time and enlarge the training set, instead of training each 
student a unique encoder and SRFE by his private train- 
ing data, students with similar ability levels are considered 
to be grouped together, sharing their private training data 
and having the same encoder and SRFE. Therefore, LANA 
firstly utilizes an interpretable Rasch model to analyze the 
ability level a*‘ for each student s;, then groups students 
into different independent layers 1;. Assuming the ability 
distribution of all students and students at the level J; are 
Gaussian distribution N({ia,02) and N(:,07) respectively, 


we have the Equation 10: 


pee Pe 6. =) o (10) 


L 


In LANA, for simplicity, we consider all layers share the 
same variance o” ny and the difference of mean jz; between 
consecutive layers is a constant 7. Hence, pi; and o? are 
given by: 

2 


So xr tix, =. (11) 


[i = Pa — 


where L = ||I;|| is the number of layers. With both 1; and a7 
retrieved for every layer l;, given a student’s ability constant 
a®*?, we can now calculate the probability of s; been grouped 
into different layers by Equation 12: 


8; _ (at =p)? 
p;? = Oe, di(a**) = 1 e 20? (12) 
yi Qi (a’i’) oi 2a 

where p;* is referred as the probability of student s; be- 
ing grouped into layer J;. As it can be seen from Equa- 
tion 12, students that have high ability levels are not neces- 
sarily grouped into layers with high expected ability levels 
fi. Contrarily, these high ability students only have a higher 
probability of been grouped into high ability layers in com- 
parison with those low ability students, which obeys rules 
in reality (e.g. high ability students may also come from 
normal schools). 


Then, the LANA model that has been pre-trained on all stu- 
dents was duplicated L times, each cloned model m; would 
be assigned to a layer 1; to be dedicatedly fine-tuned with 
l,’s private training data by weighted back-propagation: 


loss; = p; x loss(predict;, target), (13) 


where predict; is the prediction of the model m;. 


While the training phase of leveled learning seems promising, 
the inference phase of it suffers problems. The first prob- 
lem is how to make the prediction using multiple specialized 
models. In LANA, the prediction was made by top— k mod- 
els fusion. Detailedly, when student s;’s future responses 
are needed to be predicted, LANA firstly computes p;, then 
feed s;’s interactive sequence to all models m; that satis- 
fies p; € top — k(p), where k needs to be manually set up 
to control the predicting time. Then, the outputs of these 
models would be multiplied by sigmoid(p;) to form the final 
prediction. The workflow of leveled learning’s inference step 
could be described in Equation 14: 


ri = (mix) x > =P), ff € {| pi € top—k(p) }, 
en! Pr! 


at 
7 


(14) 
where r; is the leveled learning’s final prediction and 2x is 
the input of the model. This workflow seems similar to the 
ensemble where multiple models are unitized to generate the 
final answer. Nonetheless, weights of models in LANA are 
probabilities that come from an interpretable Rasch model 


"In practice, if the number of layers is small, their variances 
then need to be manually measured and tuned based on the 
targets. If the number of layers is large, then multiple layers 
can be regarded as one layer and therefore sharing the same 
variance for all layers should be fine. 
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so that it is clear which model is dominant to x. Moreover, 
unlike in ensemble, where the role of each model is ambigu- 
ous, in LANA, every model has its explainable effect (e.g. Iz, 
is committed to high ability students, and therefore a stu- 
dent with large px indicates he must be similar to those high 
ability students in Iz), suggesting that leveled learning sig- 
nificantly outperforms ensemble in interpretability. Detailed 
comparison was shown in Table 1. On the other hand, the 


Table 1: 
COMPARISON BETWEEN LEVELED LEARNING AND 
ENSEMBLE 


Leveled Learning Ensemble 


Sub-set Select 
Interpretability 
Predicting Time 


Psycho-statistics Random 
Good Bad 
Controllable (top-k) | Uncontrollable 


second problem of the leveled learning is how to compute pi 
for students that LANA has never met in training, namely 
the “cold start” problem [24]. In vanilla KT context, we can 
only initiate newly arrived students’ ability levels to the av- 
erage ability level of all students. However, in practice, we 
can estimate their ability levels more accurately by asking 
them to do a couple of sample exercises or using ranking at 
school. 


3. EXPERIMENTS 


3.1 Experimental Setup 

In order to evaluate the effectiveness of the proposed LANA?, 
we applied it to two real-world large-scale datasets in com- 
parison with many other State-OfThe-Art (SOTA) KT meth- 
ods. Specifically, EdNet [2] and RATEd2020 [7] are em- 
ployed in our experiments, where EdNet is currently the 
largest publicly available benchmark dataset in education 
domain, consisting of over 90,000,000 interactions and nearly 
800,000 students. On the other hand, RAIEd2020 is a re- 
cently published real-world dataset that has approximately 
the same size as EdNet with nearly 100,000,000 interactions 
and 400,000 students. Particularly, the average number of 
exercising interactions per student in RATEd2020 is double 
to EdNet’s. Moreover, 6 KT methods that had previously 
achieved SOTA performance have participated in the com- 
parison: DKT [16], DKVMN [26], SAKT [13], SAINT [1], 
SAINT-+ [18], AKT [4]. In terms of the basic experimen- 
tal environment, all experiments were conducted with Py- 
torch? 1.6 on a Linux server that is equipped with an Nvidia 
V100 GPU. For hyper-parameters setup, the learning rate 
was set to 5e — 4 with AdamW [11] optimizer, the length 
of the input sequence was set to 100, the batch size was 
set to 256, and other detailed configurations were listed in 
our source code. The input features « in EdNet contains 
Question ID, Question part, Students’ responses, Time in- 
terval between two consecutive interactions and Elapsed time 
of an interaction, whereas in RAIEd2020, a new feature is 
additionally added to &, which indicates Whether or not the 
student check the correct answer to the previous question. 
Finally, The Area Under the receiver operating character- 
istic Curve (AUC) was leveraged in our experiments as the 


"https ://github.com/Soptq/LANA-pytorch 
3nttps://pytorch.org/ 


Table 2: 
THE AUC CoMPARISON OF DIFFERENT METHODS 
TESTED ON EDNET AND RAIED2020 DATASETS 


Dataset Model AUC 
EdNet DKT 0.7638" 
EdNet DKVMN 0.7668" 
EdNet SAKT 0.7663" 
EdNet SAINT 0.7816 
EdNet SAINT+ 0.7913 
EdNet SAINT+ & BM 0.7935 
EdNet LANA 0.8059 

RAIEd2020 SAKT 0.7832 

RAIEd2020 AKT 0.7901 

RAIEd2020 SAINT+ 0.7956 

RAIEd2020 SAINT+ & BM 0.7991 

RAIEd2020 LANA 0.8056 
Table 3: 


INVESTIGATION OF THE EFFECTIVENESS OF DIFFERENT 
IMPROVEMENTS IN LANA 


Pivot Module 


Dataset BM PMA PC_FFN LL AUC Boost 
EdNet 0.7913 - 
EdNet v 0.7935 ‘+ 0.0022 
EdNet v 0.7997 + 0.0084 
EdNet v 0.7923 0.0010 
EdNet ¥v_0.7933—- Ft 0.0020 
EdNet v v 0.8029 0.0116 
EdNet v v 0.8015 0.0102 
EdNet v v v 0.8038 ‘+ 0.0125 
EdNet v v ¥v 0.8050 0.0137 
EdNet v v v v 0.8059 0.0146 
RAIEd2020 0.7956 - 
RAIEd2020 v 0.7991 ‘+ 0.0035 
RAIEd2020 v 0.8020 + 0.0064 
RAIEd2020 v 0.7965 0.0009 
RAIEd2020 ¥v 0.7977 =F 0.0021 
RAIEd2020 v v 0.8031 ‘+ 0.0075 
RAIEd2020 v v 0.8027 0.0071 
RAIEd2020 v v v 0.8035 ‘+ 0.0079 
RAIEd2020 v v ¥v_-0.8051 = t 0.0095 
RAIEd2020 v v v v 0.8056 0.0100 


performance metric, which has been widely used in many 
other KT-related proposals. 


For the ease of explanation, hereinafter Base Modification 
(Section 2.1), Pivot Module (Section 2.3) and Leveled Learn- 
ing (Section 2.4) would be abbreviated as BM, PM and LL 
respectively. 


3.2 Results And Analysis 


The overall experimental results of different KT methods 
on different datasets were illustrated in Table 2. Because 
we had successfully reproduced the performance of SAINT 
and SAINT-+ that was previously reported in SAINT+’s pa- 
per [18] (with considerable precision), AUCs of other models 
are therefore directly cited from the paper (labeled with sub- 
script r). 


From the comparison table, it can be seen that in both Ed- 
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Life Process of student #3 
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Figure 5: The visualization of intermediate features in 
SAINT+ (a) and in LANA (b). Compared to (a), stu- 
dents in (b) (different colors) are notably clustered (marked 
arrows). The learning process of student #3 overtime in 
SAINT+ (c) and in LANA (d). compared to (c), a clear 
learning path appeared in (d). 


Net and RAIEd2020 datasets, LANA (marked bold) outper- 
forms the previous SOTA method (marked italic) by 1.46% 
and 1.00% respectively, readily verifying the effectiveness 
of our proposed improvements. Moreover, LANA also sur- 
passes SAINT+ & BM by 1.24% and 0.65% respectively, 
suggesting adaptability contributes most to LANA’s AUC 
increment. Considering experimented datasets are by far the 
two largest knowledge tracing datasets in the world, these 
results undoubtedly provide strong evidence of the validity 
of the proposed LANA method. 


3.3 Ablation Studies 


In this section, we investigated the effectiveness of each of 
our proposed improvements: BM that customizes the basic 
transformer architecture, PM that enables the decoder to 
be adaptive to the students’ personal characteristics, and 
LL that interpretably specializes encoders and SRFEs for 
better predicting performance. The results of the ablation 
study were shown in Table 3. 


The table shows in EdNet, applying BM alone was already 
capable of improving the predicting AUC by approximately 
0.2% averagely, verifying the importance of both the action 
of positional embedding and the personalized linear projec- 
tion for each input feature in KT’s context. Meanwhile, 
applying LL solely can benefit the model performance as 
well, by generally 0.2% compared to 0.1% with the vanilla 
ensemble. Considering without PM, LL would just per- 
form fitting on students with different ability levels, the 


performance gain from sole LL could be interpreted as re- 
ductions in students’ inherent properties gaps. Moreover, 
BM + PM drastically boosts the model performance by 
nearly 1.25%, suggesting PM makes proper use of extracted 
student-related features from SRFE to adaptively reparame- 
terize the model’s decoder for different students at different 
stages, and therefore contributes most to the final perfor- 
mance gain. Finally, by combining all improvements to- 
gether, BM + PM + LL (i.e. LANA) achieves a final AUC 
of 0.8059, substantially outperforms previous SOTA by at 
least 1.46%. 


3.4 Features Visualization 

For vividly illustrating the validity of student-related fea- 
tures distilling in LANA, 20 students’ intermediate features 
from PC-FFN module was sampled to generate Figure 5 by 
t-SNE[21]. In figure 5 (a) and (b), each sample represents in- 
termediate features of different students with different colors 
in SAINT+ and LANA respectively. It can be seen that in 
SAINT-+, samples are almost randomly distributed, indicat- 
ing the correlation between samples of the same student is 
not more significant compared to samples of the others due 
to the ignorance of students’ personalities. On the other 
hand, in LANA, clusters (marked arrows) of samples have 
notably appeared in comparison to (a). Thus, we concluded 
that LANA is capable of successfully extracting student- 
related features from their interactive sequences, summariz- 
ing the similarities and differences, which eventually results 
in more distinguishable features for the final classifier. 


Furthermore, we individually visualized student #3’s (ran- 
domly picked) samples along the time axis to investigate the 
transitioning pattern of features in Figure 5 (c)(SAINT) and 
(d)(LANA). In (c), there is no clear pattern in the change 
of features over time, while in (d), a clear transitioning path 
could be noticed. Since many other students are sharing the 
same pattern in LANA, we argue that it represents the tra- 
jectory of the student’s ability changes with more and more 
exercising. Namely, it is the learning path of the student. 
Consequently, we contended that it is potentially helpful 
for other applications, such as learning stages transfer and 
learning path recommendation. 


4. CONCLUSION 


In this paper, we proposed a novel Leveled Attentive KNowledge 


TrAcing (LANA) method that was committed to bringing 
adaptability back to DKT. Instead of directly learning the 
model parameters of different students, LANA distills stu- 
dents’ inherent properties from their respective interactive 
sequences by a novel SRFE, and learns the function to repa- 
rameterize the model with these extracted student-related 
features. Consequently, innovative pivot module was pro- 
posed to produce an adaptive decoder. Besides, a novel 
leveled learning training mechanism was introduced to clus- 
ter students by interpretable Rasch model defined ability 
level, which not only specializes the encoder and therefore 
enhances the significance of students’ latent features, but 
also saves much training time. Extensive experiments on 
the two largest public benchmark datasets in the education 
domain strongly evaluate the feasibility and effectiveness of 
the proposed LANA, features visualization also suggests ex- 
tra impacts of LANA, be it learning stages transfer or learn- 
ing path recommendation. 
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